feat: Add NLP techniques and GUI to Social Media Analyzer#22
Conversation
This commit introduces two major enhancements to the Social Media Analyzer: - Integration of NLP techniques for more sophisticated scam and fake news detection. - A new web-based GUI built with React to replace the command-line interface. Backend changes: - Added `nltk` and `textblob` for NLP tasks. - Integrated sentiment analysis into the scam detector to identify messages with strong negative sentiment. - Enhanced the fake news detector with Named Entity Recognition (NER) to identify organizations and people mentioned in articles. - Created a Flask API to expose the analyzer's functionality to the frontend. Frontend changes: - Created a new React application with components for: - Scam Analyzer - Fake News Analyzer - The GUI allows users to analyze text and URLs in a user-friendly interface.
Reviewer's GuideThis pull request augments the Social Media Analyzer by transforming the CLI into a Flask-based API, integrating NLP techniques in both scam and fake-news detectors, and layering a React GUI on top for interactive analysis. Class diagram for updated scam and fake news detectorsclassDiagram
class ScamDetector {
+analyze_text_for_scams(text_content, platform, api_key)
-Sentiment Analysis (TextBlob)
-Keyword-based checks
-Regex-based checks
-Financial Identifiers
-Phone Numbers
score: float
indicators_found: list
urls_analyzed_details: list
}
class FakeNewsDetector {
+analyze_url_for_fake_news(url)
-Named Entity Recognition (NLTK)
-Fake news domain check
-Clickbait pattern check
score: float
indicators_found: list
named_entities: dict
}
ScamDetector <.. FlaskAPI
FakeNewsDetector <.. FlaskAPI
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Insecure Configuration (1)
More info on how to fix Insecure Configuration in Python. Vulnerable Libraries (1)
More info on how to fix Vulnerable Libraries in Python. 👉 Go to the dashboard for detailed results. 📥 Happy? Share your feedback with us. |
There was a problem hiding this comment.
Hey there - I've reviewed your changes and found some issues that need to be addressed.
Blocking issues:
- Detected Flask app with debug=True. Do not deploy to production with this flag enabled as it will leak sensitive information. Instead, consider using Flask configuration variables or setting 'debug' using system environment variables. (link)
General comments:
- You should configure CORS (e.g., via flask-cors) on the Flask API so the React frontend can reliably call
/analyze/*endpoints without cross‐origin errors. - Consider handling missing NLTK data and the GOOGLE_API_KEY at startup (or auto–download required corpora) to avoid runtime failures when the environment isn’t preconfigured.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- You should configure CORS (e.g., via flask-cors) on the Flask API so the React frontend can reliably call `/analyze/*` endpoints without cross‐origin errors.
- Consider handling missing NLTK data and the GOOGLE_API_KEY at startup (or auto–download required corpora) to avoid runtime failures when the environment isn’t preconfigured.
## Individual Comments
### Comment 1
<location> `text_message_analyzer/app.py:36` </location>
<code_context>
-if __name__ == "__main__":
- app.run(host="0.0.0.0", port=8080)
+if __name__ == '__main__':
+ app.run(debug=True)
</code_context>
<issue_to_address>
**🚨 issue (security):** Enabling debug mode in production can expose sensitive information.
Ensure debug mode is disabled in production, ideally by controlling it with an environment variable.
</issue_to_address>
### Comment 2
<location> `social_media_analyzer/fake_news_detector.py:29-32` </location>
<code_context>
score = 0.0
indicators_found = []
+ named_entities = {
+ "organizations": [],
+ "persons": [],
+ }
# 1. Check against known fake news domains
</code_context>
<issue_to_address>
**issue:** Named entity extraction does not handle cases where NLTK data is missing.
Catch NLTK exceptions and provide a user-friendly error message or guidance on downloading missing data.
</issue_to_address>
### Comment 3
<location> `text_message_analyzer/app.py:36` </location>
<code_context>
app.run(debug=True)
</code_context>
<issue_to_address>
**security (python.flask.security.audit.debug-enabled):** Detected Flask app with debug=True. Do not deploy to production with this flag enabled as it will leak sensitive information. Instead, consider using Flask configuration variables or setting 'debug' using system environment variables.
*Source: opengrep*
</issue_to_address>
### Comment 4
<location> `social_media_analyzer/fake_news_detector.py:23` </location>
<code_context>
def analyze_url_for_fake_news(url):
"""
Analyzes a URL for indicators of fake news.
NOTE: This function requires the following NLTK data to be downloaded:
- 'punkt'
- 'averaged_perceptron_tagger'
- 'maxent_ne_chunker'
- 'words'
"""
if not url.startswith(('http://', 'https://')):
url = 'http://' + url
domain = urlparse(url).netloc.lower()
score = 0.0
indicators_found = []
named_entities = {
"organizations": [],
"persons": [],
}
# 1. Check against known fake news domains
if domain in FAKE_NEWS_DOMAINS:
score += HEURISTIC_WEIGHTS.get("KNOWN_FAKE_NEWS_DOMAIN", 5.0)
indicators_found.append(f"Domain '{domain}' is a known source of fake news.")
return {
"url": url,
"score": round(score, 2),
"indicators_found": indicators_found
}
# 2. Fetch and analyze content
try:
headers = {'User-Agent': 'Mozilla/5.0'}
request = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(request, timeout=10) as response:
if response.status == 200:
html_content = response.read().decode('utf-8', errors='ignore')
text_content = re.sub(r'<[^>]+>', '', html_content).lower()
# 3. Analyze text for sensationalist keywords
for keyword in SENSATIONALIST_KEYWORDS:
if keyword in text_content:
score += HEURISTIC_WEIGHTS.get("SENSATIONALIST_KEYWORD", 1.0)
indicators_found.append(f"Found sensationalist keyword: '{keyword}'")
# 4. Analyze text for clickbait patterns
for pattern in CLICKBAIT_PATTERNS:
if re.search(pattern, text_content, re.IGNORECASE):
score += HEURISTIC_WEIGHTS.get("CLICKBAIT_PATTERN", 1.5)
indicators_found.append(f"Found clickbait pattern: '{pattern}'")
# 5. Named Entity Recognition
tokens = nltk.word_tokenize(text_content)
tagged = nltk.pos_tag(tokens)
entities = nltk.ne_chunk(tagged)
for entity in entities:
if isinstance(entity, nltk.Tree):
entity_text = " ".join([word for word, tag in entity.leaves()])
if entity.label() == 'ORGANIZATION':
if entity_text not in named_entities["organizations"]:
named_entities["organizations"].append(entity_text)
elif entity.label() == 'PERSON':
if entity_text not in named_entities["persons"]:
named_entities["persons"].append(entity_text)
else:
return {"error": f"Failed to fetch URL: HTTP status code {response.status}"}
except Exception as e:
return {"error": f"An error occurred: {e}"}
return {
"url": url,
"score": round(score, 2),
"indicators_found": indicators_found,
"named_entities": named_entities
}
</code_context>
<issue_to_address>
**issue (code-quality):** We've found these issues:
- Use f-string instead of string concatenation ([`use-fstring-for-concatenation`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/use-fstring-for-concatenation/))
- Swap if/else branches ([`swap-if-else-branches`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/swap-if-else-branches/))
- Remove unnecessary else after guard condition ([`remove-unnecessary-else`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/remove-unnecessary-else/))
</issue_to_address>
### Comment 5
<location> `social_media_analyzer/test_scam_detector.py:10-15` </location>
<code_context>
def test_sentiment_analysis(self):
# Test case for negative sentiment
text_negative = "This is a terrible, awful, no good, very bad message."
result_negative = analyze_text_for_scams(text_negative)
self.assertIn("Strong negative sentiment detected in text.", [indicator for indicator in result_negative["indicators_found"]])
# Test case for positive sentiment
text_positive = "This is a wonderful, amazing, great message."
result_positive = analyze_text_for_scams(text_positive)
self.assertNotIn("Strong negative sentiment detected in text.", [indicator for indicator in result_positive["indicators_found"]])
</code_context>
<issue_to_address>
**issue (code-quality):** Replace identity comprehension with call to collection constructor [×2] ([`identity-comprehension`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/identity-comprehension/))
<br/><details><summary>Explanation</summary>Convert list/set/tuple comprehensions that do not change the input elements into.
#### Before
```python
# List comprehensions
[item for item in coll]
[item for item in friends.names()]
# Dict comprehensions
{k: v for k, v in coll}
{k: v for k, v in coll.items()} # Only if we know coll is a `dict`
# Unneeded call to `.items()`
dict(coll.items()) # Only if we know coll is a `dict`
# Set comprehensions
{item for item in coll}
```
#### After
```python
# List comprehensions
list(iter(coll))
list(iter(friends.names()))
# Dict comprehensions
dict(coll)
dict(coll)
# Unneeded call to `.items()`
dict(coll)
# Set comprehensions
set(coll)
```
All these comprehensions are just creating a copy of the original collection.
They can all be simplified by simply constructing a new collection directly. The
resulting code is easier to read and shows the intent more clearly.
Convert list/set/tuple comprehensions that do not change the input elements into.
#### Before
```python
# List comprehensions
[item for item in coll]
[item for item in friends.names()]
# Dict comprehensions
{k: v for k, v in coll}
{k: v for k, v in coll.items()} # Only if we know coll is a `dict`
# Unneeded call to `.items()`
dict(coll.items()) # Only if we know coll is a `dict`
# Set comprehensions
{item for item in coll}
```
#### After
```python
# List comprehensions
list(iter(coll))
list(iter(friends.names()))
# Dict comprehensions
dict(coll)
dict(coll)
# Unneeded call to `.items()`
dict(coll)
# Set comprehensions
set(coll)
```
All these comprehensions are just creating a copy of the original collection.
They can all be simplified by simply constructing a new collection directly. The
resulting code is easier to read and shows the intent more clearly.</details>
</issue_to_address>
### Comment 6
<location> `social_media_analyzer/test_scam_detector.py:19-26` </location>
<code_context>
def test_keyword_matching(self):
# Test case for urgency keyword
text_urgency = "URGENT: Your account has been compromised."
result_urgency = analyze_text_for_scams(text_urgency)
self.assertIn("Presence of 'Urgency' keyword: 'urgent'", [indicator for indicator in result_urgency["indicators_found"]])
# Test case for stemming
text_stemming = "I need you to verify your account immediately."
result_stemming = analyze_text_for_scams(text_stemming)
self.assertIn("Presence of 'Sensitive Info' keyword: 'verify your account'", [indicator for indicator in result_stemming["indicators_found"]])
</code_context>
<issue_to_address>
**issue (code-quality):** We've found these issues:
- Extract duplicate code into method ([`extract-duplicate-method`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/extract-duplicate-method/))
- Replace identity comprehension with call to collection constructor [×2] ([`identity-comprehension`](https://docs.sourcery.ai/Reference/Default-Rules/refactorings/identity-comprehension/))
<br/><details><summary>Explanation</summary>
Convert list/set/tuple comprehensions that do not change the input elements into.
#### Before
```python
# List comprehensions
[item for item in coll]
[item for item in friends.names()]
# Dict comprehensions
{k: v for k, v in coll}
{k: v for k, v in coll.items()} # Only if we know coll is a `dict`
# Unneeded call to `.items()`
dict(coll.items()) # Only if we know coll is a `dict`
# Set comprehensions
{item for item in coll}
```
#### After
```python
# List comprehensions
list(iter(coll))
list(iter(friends.names()))
# Dict comprehensions
dict(coll)
dict(coll)
# Unneeded call to `.items()`
dict(coll)
# Set comprehensions
set(coll)
```
All these comprehensions are just creating a copy of the original collection.
They can all be simplified by simply constructing a new collection directly. The
resulting code is easier to read and shows the intent more clearly.
Convert list/set/tuple comprehensions that do not change the input elements into.
#### Before
```python
# List comprehensions
[item for item in coll]
[item for item in friends.names()]
# Dict comprehensions
{k: v for k, v in coll}
{k: v for k, v in coll.items()} # Only if we know coll is a `dict`
# Unneeded call to `.items()`
dict(coll.items()) # Only if we know coll is a `dict`
# Set comprehensions
{item for item in coll}
```
#### After
```python
# List comprehensions
list(iter(coll))
list(iter(friends.names()))
# Dict comprehensions
dict(coll)
dict(coll)
# Unneeded call to `.items()`
dict(coll)
# Set comprehensions
set(coll)
```
All these comprehensions are just creating a copy of the original collection.
They can all be simplified by simply constructing a new collection directly. The
resulting code is easier to read and shows the intent more clearly.</details>
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| named_entities = { | ||
| "organizations": [], | ||
| "persons": [], | ||
| } |
There was a problem hiding this comment.
issue: Named entity extraction does not handle cases where NLTK data is missing.
Catch NLTK exceptions and provide a user-friendly error message or guidance on downloading missing data.
| self.assertIn("Strong negative sentiment detected in text.", [indicator for indicator in result_negative["indicators_found"]]) | ||
|
|
||
| # Test case for positive sentiment | ||
| text_positive = "This is a wonderful, amazing, great message." | ||
| result_positive = analyze_text_for_scams(text_positive) | ||
| self.assertNotIn("Strong negative sentiment detected in text.", [indicator for indicator in result_positive["indicators_found"]]) |
There was a problem hiding this comment.
issue (code-quality): Replace identity comprehension with call to collection constructor [×2] (identity-comprehension)
Explanation
Convert list/set/tuple comprehensions that do not change the input elements into.Before
# List comprehensions
[item for item in coll]
[item for item in friends.names()]
# Dict comprehensions
{k: v for k, v in coll}
{k: v for k, v in coll.items()} # Only if we know coll is a `dict`
# Unneeded call to `.items()`
dict(coll.items()) # Only if we know coll is a `dict`
# Set comprehensions
{item for item in coll}After
# List comprehensions
list(iter(coll))
list(iter(friends.names()))
# Dict comprehensions
dict(coll)
dict(coll)
# Unneeded call to `.items()`
dict(coll)
# Set comprehensions
set(coll)All these comprehensions are just creating a copy of the original collection.
They can all be simplified by simply constructing a new collection directly. The
resulting code is easier to read and shows the intent more clearly.
Convert list/set/tuple comprehensions that do not change the input elements into.
Before
# List comprehensions
[item for item in coll]
[item for item in friends.names()]
# Dict comprehensions
{k: v for k, v in coll}
{k: v for k, v in coll.items()} # Only if we know coll is a `dict`
# Unneeded call to `.items()`
dict(coll.items()) # Only if we know coll is a `dict`
# Set comprehensions
{item for item in coll}After
# List comprehensions
list(iter(coll))
list(iter(friends.names()))
# Dict comprehensions
dict(coll)
dict(coll)
# Unneeded call to `.items()`
dict(coll)
# Set comprehensions
set(coll)All these comprehensions are just creating a copy of the original collection.
They can all be simplified by simply constructing a new collection directly. The
resulting code is easier to read and shows the intent more clearly.
This commit introduces two major enhancements to the Social Media Analyzer:
Backend changes:
nltkandtextblobfor NLP tasks.Frontend changes:
Summary by Sourcery
Add NLP-driven fraud and misinformation detection and provide a React-based web interface powered by a Flask API.
New Features:
/analyze/scamand/analyze/fake-newsendpoints via a Flask API.Enhancements:
Build:
Tests: